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Abstract 

We  describe  how  to  improve  the  performance  of 
MDP  planning  algorithms  by  modifying  them  to 
use  the  search-control  mechanisms  of  planners 
such  as  TLPIan,  SHOP2,  and  TALplanner.  In  our 
experiments,  modified  versions  of  RTDP,  LRTDP, 
and  Value  Iteration  were  exponentially  faster  than 
the  original  algorithms.  On  the  largest  problems 
the  original  algorithms  could  solve,  the  modified 
ones  were  about  10,000  times  faster.  On  another 
set  of  problems  whose  state  spaces  were  more  than 
14,000  times  larger  than  the  original  algorithms 
could  solve,  the  modified  algorithms  took  only 
about  1/3  second. 

Introduction 

Planning  algorithms  for  MDPs  typically  have  large 
efficiency  problems  due  to  the  need  to  explore  all 
or  most  of  the  state  space.  For  complex  plan¬ 
ning  problems,  the  state  space  can  be  quite  huge. 
For  planning  problems  expressed  using  probabilis¬ 
tic  STRIPS  operators  (Hanks  &  McDermott  1993; 
Kushmerick,  Hanks,  &  Weld  1994)  or  2TBNs  (Hoey 
et  al.  1999;  Boutilier  &  Goldszmidt  1996),  planning 
is  EXPTIME-harcl  (Littman  1997).  This  paper  fo¬ 
cuses  on  a  way  to  improve  the  efficiency  of  plan¬ 
ning  on  MDPs  by  adapting  the  techniques  used  in 
domain-configurable  classical  planners. 

A  domain-configurable  planner  consists  of  a 
domain-independent  search  engine  that  can  make 
use  of  domain-specific  (but  problem-independent) 
search-control  knowledge  that  is  given  to  the  plan¬ 
ner  as  part  of  its  input.  Examples  include  plan¬ 
ners  such  as  TLPIan  (Bacchus  &  Kabanza  2000)  and 
TALplanner  (Kvarnstrom  &  Doherty  2001)  in  which 
the  search-control  knowledge  consists  of  pruning 
rules  written  in  temporal  logic,  and  Hierarchi¬ 
cal  Task  Network  (HTN)  planners  such  as  SIPE- 
2  (Wilkins  1990),  O-Plan  (Currie  &  Tate  1991), 
and  SHOP2  (Nau  et  al.  2003),  in  which  the 
search-control  knowledge  consists  of  HTN  “meth¬ 
ods”  (task-decomposition  templates). 
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Domain-configurable  planners  have  been  highly 
successful.  In  the  Al  planning  competitions,  such 
planners  consistently  worked  in  most  domains, 
solved  the  most  problems,  and  solved  them  fastest 
(Bacchus  2001;  Fox  &  Long  2002).  They  are  used  in 
a  large  variety  of  real-world  applications  (Wilkins 
1990;  Currie  &  Tate  1991;  Nau  et  al.  2003). 

Our  contributions  are  as  follows: 

•  We  describe  how  to  modify  any  forward-chaining 
MDP  planning  algorithm,  by  incorporating  into 
it  the  search-control  algorithm  from  any  forward¬ 
chaining  domain-configurable  planner. 

•  We  describe  conditions  under  which  our  mod¬ 
ified  MDP  planning  algorithms  are  guaranteed 
to  find  optimal  answers,  and  conditions  under 
which  they  can  do  so  exponentially  faster  than 
the  original  MDP  planners. 

•  We  have  applied  our  approach  to  Real-time  Dy¬ 
namic  Programming  (RTDP)  (Bonet  &  Geffner 
2000),  Labeled  RTDP  (LRTDP)  (Bonet  &  Geffner 
2003),  and  a  forward-chaining  version  of  Value 
Iteration  (Berstekas  1995).  Our  experimental  re¬ 
sults  show  our  modified  algorithms  running  ex¬ 
ponentially  faster  than  the  original  ones.  On  the 
largest  problems  the  original  algorithms  could 
solve,  the  modified  ones  ran  about  10,000  times 
faster.  In  only  about  1/3  second,  the  modified  al¬ 
gorithms  could  solve  problems  whose  state  spaces 
were  more  than  14,000  times  larger. 

Background 

Domain- Configurable  Classical  Planners. 

We  use  the  usual  definition  of  a  classical  planning 
problem.  In  Fig.  1,  Controlled-Plan  is  an  abstract 
version  of  a  forward-chaining  domain-configurable 
planner,  s  is  the  current  state,  G  is  the  goal,  7r  is 
the  current  plan,  D  is  the  domain  description,  and 
x  is  the  auxiliary  information  used  by  the  search- 
control  function  acceptable(s,  a,  x,  D).  result(s,a) 
is  the  state  produced  by  applying  the  action  a  to  s, 
and  progress(s,  a,  x,  D )  is  the  auxiliary  information 
to  use  in  the  next  state.  Here  are  two  examples: 
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Procedure  Controlled-Plan(s,  G,  n,  x,  D) 
if  s  £  G  then  return(7r) 
actions  <—  {a  \  a  is  applicable  to  s 

and  acceptable(s,  a,  x,  D)  holds} 
if  actions  =  0  then  return  (failure) 
nondeterministically  choose  a  £  actions 
s'  <—  result{s ,  a)\  n1  <—  append(n,  a) 
x1  <—  progress  (s,a,x,D) 
return(Controlled-Plan(s,,  G,  n',x'  ,D)) 


Figure  1:  An  abstract  version  of  a  forward-chaining 
domain-configurable  classical  planner. 

•  TLPIan  (Bacchus  &  Kabanza  2000)  maintains 
a  control  formula  written  in  a  modal  temporal 
logic,  and  backtracks  whenever  the  current  con¬ 
trol  formula  evaluates  to  FALSE.  TLPIan  is  an 
instance  of  Controlled-Plan  in  which  x  is  the  cur¬ 
rent  control  formula,  D  is  the  set  of  all  actions 
in  the  domain,  progress(s,  a,  x,  D)  is  the  next 
control  formula  generated  by  TLPIan’s  temporal- 
progression  algorithm,  and  acceptable(s,  a,  x,  D) 
holds  for  all  actions  a  applicable  in  s  such  that 
the  state  result(s,a)  satisfies  progress(s,  a,  x,  D). 

•  SHOP2  (Nau  et  al.  2003),  an  HTN  planner,  is  an 
instance  of  Controlled-Plan  in  which  x  is  the  cur¬ 
rent  task  network  and  D  contains  all  actions  and 
methods  in  the  domain.  acceptable(s,  a,  x,  D) 
holds  for  all  actions  a  such  that  (i)  a  appears 
in  some  task  network  x1  that  is  produced  by  re¬ 
cursively  decomposing  tasks  in  x,  and  (ii)  a  has 
no  predecessors  in  x'.  progress [s,a,x,D)  is  the 
task  network  produced  by  removing  a  from  x'. 

MDP-based  Planning.  We  consider  MDPs  of 
the  form  M  =  (S,  A,  app,  Pr,  C,  R),  where: 

•  S  and  A,  the  sets  of  states  and  actions,  are  finite. 

•  app(s)  is  the  set  of  all  actions  applicable  in  s. 

•  For  every  a  €  app(s ),  Pr(s,  a ,  s')  is  the  probabil¬ 
ity  of  the  state-transition  (s,  a,  s'). 

•  For  every  s  £  S  and  every  a  applicable  to  s, 
C(s,  a)  >  0  is  the  cost  of  applying  a  to  s. 

•  For  every  s  £  S,  R(s)  >  0  is  the  reward  for  s. 

An  MDP  planning  problem  is  a  triple  V  = 
(AI,  So,  G),  where  Al  =  (S,  A,  app,  Pr,C,  R)  is  an 
MDP,  So  C  S'  is  the  set  of  initial  states,  and  G  C  S' 
is  the  set  of  goal  states. 

For  a  state  s  in  an  MDP  problem  V  —  ( M ,  So,  G), 
we  define  results(s,a)  =  {s'  |  Pr(s,a,s')  >  0}  if 
a  £  app(s);  results(s,a)  =  0  otherwise.  We  define 
succ(s)  =  U{s'  G  results(s,  a)  \  a  €  app(s)}.  We 
say  that  s  is  a  terminal  state  if  s  is  a  goal  state  (i.e., 
s  £  G)  or  app(s)  =  0. 

The  value  of  a  state  s  and  a  state-action  pair 
(s,  a)  are  defined  recursively  as  follows: 


_  J  R{s),  if  s  is  terminal 

^  -  \  maxaeopp(s)  Q(s,  a),  otherwise  *-  ’ 

Q{s,  a)  =  R(s)  —  C(s,  a) 

+  7  Pr(s,  a,  s')  V (s')  (2) 

s'  ^results  (s , a) 

where  0.0  <  7  <  1.0  is  the  discount  factor.  The 
residual  of  a  state  s  is  defined  as  the  difference  be¬ 
tween  the  left  and  right  sides  of  the  Eqn.  1. 

A  policy  is  a  partial  function  from  S  into  A;  i.e., 
7T  :  Sn  — >  A  for  some  set  S,  C  5.  The  size  of  n  is 
ISttI-  V)r(s),  the  value  of  s  induced  by  the  policy  7r, 
is  defined  similarly  to  V ( s )  except  that  the  actions 
applied  to  each  state  s  are  only  the  ones  in  7r(s). 

An  optimal  solution  for  an  MDP  planning  prob¬ 
lem  V  =  ( M ,  So,  G)  is  a  policy  7 r  such  that  when  7r 
is  executed  in  any  of  the  initial  states  S0,  it  reaches 
a  goal  state  in  G  with  probability  1,  and  for  every 
state  s  £  So,  there  is  no  other  policy  tt'  such  that 

M«)>K(s). 

Note  that  the  value  function  V  defined  in  Eqn.  1 
need  not  to  be  a  total  function.  Given  an  MDP 
planning  problem  V  =  (M,Sq,G)  and  an  optimal 
solution  7 r  for  it,  the  set  of  states  that  are  reachable 
from  the  initial  states  So  by  using  7r  constitute  a 
minimal  set  of  states  over  which  the  value  function 
V  needs  to  be  defined. 

Researchers  have  often  defined  MDP  versions  of 
classical  planning  problems.  An  MDP  planning 
problem  VF  is  an  MDP  version  of  a  classical  prob¬ 
lem  V  iff  the  two  problems  have  the  same  states, 
initial  state,  and  goal,  and  if  there  is  a  one-to- 
one  mapping  det  from  VF,s  actions  to  P’s  actions 
such  that  for  every  action  a  in  VF ,  a  and  det.(a) 
are  applicable  to  exactly  the  same  states,  and  for 
each  such  state  s,  result (s,det(a))  £  results(s,a). 
The  additional  states  in  results (s,  a)  can  be  used 
to  model  various  sources  of  the  uncertainty  in  the 
domain,  such  as  action  failures  (e.g.,  a  robot  grip¬ 
per  may  drop  its  load)  and  exogenous  events  (e.g., 
a  road  is  closed).  We  call  det(a)  the  deterministic 
version  of  a,  and  a  the  MDP  version  of  det(a). 

Search  Control  for  MDPs 

We  now  describe  how  to  incorporate  the  search- 
control  function  acceptable  of  Fig.  1  into  MDP  plan¬ 
ning  algorithms.  Many  MDP  planning  algorithms 
can  be  viewed  as  forward-search  procedures  em¬ 
bedded  inside  iteration  loops.1  The  forward-search 
procedure  starts  at  So  and  searches  forward  by  ap¬ 
plying  actions  to  states,  computing  a  policy  and/or 

1For  example,  in  Fig.  2,  a  forward-chaining  version 
of  Value  Iteration,  the  iteration  loop  is  the  outer  while 
loop,  and  the  forward-search  procedure  is  everything 
inside  that  loop. 


Procedure  Fwd-VI 

select  any  initialization  for  V;  n  <—  0 
while  V  has  not  converged  do 

S  * —  So ;  Visited  <—  0 

while  S  yf  0  do 

for  every  state  s  £  S  C\G,  F(s)  <—  R(s) 

S  <-  S  \  G;  5"  <-  0 
for  every  state  s  £  S 
for  every  a  £  app(s) 

Q(s,  a)  <—  ( R(s )  —  C(s ,  a)) 

+7  E s'eresultsM  Pr(S -  «>  «')  ^ («') 
S'  <—  S'  U  {s  |  s  €  results(s,  a)} 

V (s)  *  maxa6app(s)  Q(s,  a') 
n{s)  <-  argmaxogapp(s)Q(s,a) 

Visited  <—  Visited  U  S 
S  <-  S'  \  Visited 
return  7r 


Figure  2:  Fwd-VI,  a  forward-chaining  version  of 
Value  Iteration. 


a  set  of  utility  values  as  the  search  progresses.  The 
iteration  loop  continues  until  some  sort  of  conver¬ 
gence  criterion  is  satisfied  (e.g.,  until  two  successive 
iterations  produce  identical  utility  values  for  every 
node,  or  until  the  residual  of  every  node  becomes 
less  than  or  equal  to  a  termination  criterion  e  >  0). 

Planners  like  RTDP  and  LRTDP  fit  directly  into 
this  format.  These  planners  repeatedly  (1)  do  a 
greedy  search  going  forward  in  the  state  space, 
then  (2)  update  the  values  of  the  visited  states 
in  a  dynamic-programming  fashion.  Even  the  well 
known  Value  Iteration  algorithm  can  be  made  to  fit 
into  the  above  format,  by  making  sure  to  compute 
the  values  in  a  forward-chaining  manner  (see  the 
Fwd-VI  algorithm  in  Fig.  2). 

During  the  forward  search,  at  each  state  s  that 
the  planner  visits,  it  needs  to  know  app(s),  the  set 
of  all  actions  applicable  to  s.  For  example,  the 
inner  for  loop  of  Fwd-VI  iterates  over  the  actions  in 
app(s)\  and  RTDP  and  LRTDP  choose  whichever 
action  in  app(s)  currently  has  the  best  value. 

Let  Z  be  a  forward-chaining  MDP  planning  algo¬ 
rithm,  F  be  an  instance  of  Controlled-Plan  (see  Fig. 
1),  and  acceptable^  be  F’s  search-control  function. 
We  will  now  define  ZF ,  a  modified  version  of  Z  in 
which  every  occurrence  of  app(s)  is  replaced  by 

{a  £  app(s)  |  acceptableF(s,  det(a),  x,  D)  holds}. 

The  reason  we  require  Z  to  be  a  forward-chaining 
MDP  algorithm  is  because  the  auxiliary  informa¬ 
tion  x  is  computed  by  progression  from  s’ s  parent. 
Here  are  some  examples  of  ZF: 

•  Fwd-VITLPI:  in  is  the  forward-chaining  version  of 
Value  Iteration  combined  with  TLPIan’s  search 
control  function, 

•  RTDPTALpianner  is  RTDP  combined  with  TALplan- 
ner’s  search  control  function, 


•  LRTDPsh0P2  is  LRTDP  combined  with  SHOP2’s 
search  control  function. 


Formal  Properties 

Let  Z  be  a  forward-chaining  MDP  planning  al¬ 
gorithm  that  is  guaranteed  to  return  an  opti¬ 
mal  solution  if  one  exists,  F  be  an  instance  of 
Controlled-Plan,  and  acceptable^  be  F’s  search  con¬ 
trol  function.  Suppose  M  =  ( S ,  A,  app,  Pr,  C,  R )  is 
an  MDP  and  V  =  ( M,Sq,G )  be  a  planning  prob¬ 
lem  over  M .  Then  we  can  define  the  reduced  MDP 
M F  and  planning  problem  VF  as  follows: 


appF (s)  =  {a  £app{s)  \  acceptabl eF(s,det(a),x,D) 
holds}; 


residtsF  (s,  a) 


results(s,a)  if  a  €  appF(s ), 
0  otherwise; 


succF{s )  =  1J{S'  £  resultsF (s,  a)  |  a  £  appF(s)}; 

SF  ^transitive  closure  of  succF  over  Sq; 


Gf  =GnSF; 

Mf  =(Sf,  A,  appF ,  Pr,  C,  R); 
VF  ={Mf,  So,  Gf). 


Recall  that  in  every  place  where  the  algorithm 
Z  uses  app(s),  the  algorithm  ZF  instead  uses  {a  £ 
app(s)  |  acceptabl eF(s,  det(a),  x,  D)  holds}.  Thus 
from  the  above  definitions,  it  follows  that  running 
ZF  on  V  is  equivalent  to  running  Z  on  VF . 

We  say  that  acceptableF  is  admissible  for  V  if 
for  every  state  s  in  V,  there  is  an  action  a  £ 
app(s)  such  that  acceptabl eF(s,det(a),x,D)  holds 
and  F(s)  =  Q(s,a),  where  F(s)  and  Q(s,a )  are  as 
in  Eqs.  1—2.  From  this  we  get  the  following: 

Theorem  1  Suppose  Z  returns  a  policy  n  for  V. 
Then,  ZF  returns  a  policy  7 r'  for  V  such  that 
V-n (s)  =  1 4-'(s)  for  every  s  €  So,  if  acceptableF 
is  admissible  for  V. 

Next,  we  consider  the  computational  complexity 
of  Z  and  ZF .  This  depends  heavily  on  the  search 
space.  If  V  =  ( M,Sq,G )  is  a  planning  problem 
over  an  MDP  M  =  (S,  A,  app,  Pr,C,  R),  then  the 
search  space  for  Fwd-VI  on  P  is  a  digraph  rF  = 
(N,  E ) ,  where  N  is  the  transitive  closure  of  succ 
over  So,  and  E  =  {(s,  s')  \  s  £  N,  s'  £  swcc(s)}. 
For  algorithms  like  RDTP  and  LRTDP,  the  search 
space  is  a  subgraph  of  T-p. 

If  F  is  an  instance  of  Controlled-Plan,  then  the 
search  space  for  Fwd-VIF  is  TF  =  ( SF,EF ),  where 
SF  is  as  defined  earlier,  and  EF  =  {(s,  s')  |  s  £ 
SF,  s'  £  swccF(s)}. 

The  worst  case  is  where  T-p  =  TF;  this  happens  if 
F' s  search-control  function,  acceptableF,  does  not 
remove  any  actions  from  the  search  space. 

On  the  other  hand,  there  are  many  planning 
problems  in  which  acceptableF  will  remove  a  large 


number  of  applicable  actions  at  each  state  in  the 
search  space  (some  examples  occur  in  the  next  sec¬ 
tion).  In  such  cases,  acceptable^  can  produce  an 
exponential  speedup,  as  illustrated  in  the  following 
simple  example. 

Suppose  Tp  is  a  tree  in  which  every  state  at 
depth  d  is  a  goal  state,  and  for  every  state  of  depth 
<  d,  there  are  exactly  b  applicable  actions  and  each 
of  those  actions  has  exactly  k  possible  outcomes. 
Then  T-p’s  branching  factor  is  bk,  so  the  number 
of  nodes  is  0((6fc)d).  Next,  suppose  F’s  search- 
control  function  eliminates  exactly  half  of  the  ac¬ 
tions  at  each  state.  Then  T^  is  a  tree  of  depth  d  and 
branching  factor  (b/2)k,  so  it  contains  0(((6/2)/c)d) 
nodes.  In  this  case,  the  ratio  between  the  number 
of  nodes  visited  by  Fwd-VI  and  Fwd-VIF  is  2d,  so 
Fwd-VIF  is  exponentially  faster  than  Fwd-VI. 

Experimental  Evaluation 

For  our  experiments,  we  used  Fwd-VI,  RTDP,  and 
LRTDP,  and  their  enhanced  versions  Fwd-VISH0P2, 
RTDPsh0P2,  and  LRTDPSH0P2.  For  meaningful 
tests  of  the  enhanced  algorithms,  we  needed  plan¬ 
ning  problems  with  much  bigger  state  spaces  than 
in  prior  published  tests  of  RTDP  and  LRTDP.  For 
this  purpose,  we  chose  the  following  two  domains. 

One  was  the  Probabilistic  Blocks  World  (PBW) 
from  the  2004  International  Probabilistic  Planning 
Competition,  with  a  15%  probability  that  a  pickup 
or  putdown  action  would  drop  the  block  on  the  ta¬ 
ble.  The  size  of  the  state  space  grows  combina- 
torially  with  the  number  of  blocks:  with  3  blocks 
there  are  only  13  states,  but  with  10  blocks  there 
are  58,941,091  states. 

The  other  was  an  MDP  adaptation  of  the  Robot 
Navigation  domain  (Kabanza,  Barbeau,  &  St-Denis 
1997;  Pistore,  Bettin,  &  Traverso  2001).  A  building 
has  8  rooms  and  7  doors.  Some  of  the  doors  are 
called  kid  doors.  Whenever  a  kid  door  is  open,  a 
“kid”  can  close  it  randomly  with  a  probability  of 
0.5.  If  the  robot  tries  to  open  a  closed  kid  door,  this 
action  may  fail  with  a  probability  of  0.5  because 
the  kid  immediately  closes  the  door.  Packages  are 
distributed  throughout  the  rooms,  and  need  to  be 
taken  to  other  rooms.  The  robot  can  carry  only 
one  package  at  a  time.  When  there  are  5  packages, 
the  state  space  contains  54,525,952  states. 

In  both  domains,  we  used  a  reward  of  500  for 
goal  states  and  0  for  all  other  states,  a  cost  of  1 
for  each  action,  a  discount  factor  7  =  1.0,  and  a 
termination  criterion  e  =  10~8. 

The  RTDP  and  LRTDP  algorithms  use  domain- 
independent  heuristics  to  initialize  their  value  func¬ 
tions.  We  used  two  such  heuristics.  One,  6500,  ini¬ 
tializes  the  value  of  every  state  to  500.  The  other, 
hmax,  is  Bonet  &  Geffner’s  (2003)  hmin  heuristic 


Figure  3:  Running  times  for  PBW  using  h500 ,  plot¬ 
ted  on  a  semi-log  scale.  With  6  blocks  (6  =  6),  the 
modified  algorithms  are  about  10,000  times  as  fast 
as  the  original  ones. 


Figure  4:  Running  times  for  PBW  using  hmax,  plot¬ 
ted  on  a  semi- log  scale.  Like  before,  when  6=6  the 
modified  algorithms  are  about  10,000  times  as  fast 
as  the  original  ones. 

adapted  to  work  on  maximization  problems: 

Q(s,a)  =  R(s)  —  C(s,a) 

+  max  Pr(s,a,s')  V(s')  (3) 

s'  (^results  (s , a) 

We  implemented  all  six  of  the  planners  in  Lisp,2, 
and  tested  them  on  a  HP  Pavilion  N5415  with 
256MB  memory  running  Linux  Fedora  Core  2. 

On  most  of  the  problems,  Fwd-VI  failed  due  to 
memory  overflows,  so  we  do  not  report  any  results 
for  it.  Figures  3  and  4  show  the  average  run  times 
of  all  the  other  five  planners  in  the  PBW  domain. 
Each  run  time  includes  the  time  needed  to  com¬ 
pute  6500  or  hmax ,  and  each  data  point  is  the  av¬ 
erage  of  20  runs.  The  run  times  for  RTDP  and 
LRTDP  were  almost  the  same,  and  so  were  those 
of  the  three  enhanced  planners.  Every  algorithm’s 

2The  authors  of  RTDP  and  LRTDP  were  willing  to 
let  us  use  their  C++  implementations,  but  we  needed 
LISP  in  order  to  use  SHOP2’  search-control  mechanism. 


Table  1:  Run  times  using  h.500  on  Robot-Navigation 
problems  with  one  kid  door,  p  is  the  number  of  pack¬ 
ages.  Each  data  point  is  the  average  of  20  problems. 


p  = 

1 

2 

3 

4 

5 

RTDP 

10.213 

254.642 

- 

- 

- 

LRTDP 

11.804 

1622.679 

- 

- 

- 

Fwd-VISH0P2 

0.009 

0.026 

0.056 

0.076 

0.137 

RTDPshop2 

0.01 

0.023 

0.046 

0.069 

0.11 

LRTDPshop2 

0.016 

0.031 

0.063 

0.088 

0.167 

run  time  grows  exponentially,  but  the  growth  rate 
is  much  smaller  for  the  enhanced  algorithms  than 
for  the  original  ones — for  example,  at  b  =  6  they 
have  about  1/10,000  of  the  run  time  of  the  origi¬ 
nal  algorithms.  Once  we  got  above  6  blocks  (4,051 
states  in  the  state  space),  the  original  algorithms 
ran  out  of  memory.3  In  contrast,  the  modified  al¬ 
gorithms  could  easily  have  handled  problems  with 
more  than  10  blocks  (more  than  58,941,091  states). 

The  reason  for  the  fast  performance  of  the  mod¬ 
ified  algorithms  is  that  in  an  HTN  planner  like 
SHOP2,  it  is  very  easy  to  specify  domain-specific 
(but  problem-independent)  strategies  like  “if  there 
is  a  clear  block  that  you  can  move  to  a  place  where 
it  will  never  need  to  be  moved  again,  then  do  so 
without  considering  any  other  actions,”  and  “if  you 
drop  a  block  on  the  table,  then  pick  it  up  again  im¬ 
mediately.”  Such  strategies  reduce  the  size  of  the 
search  space  tremendously. 

Tables  1  and  2  show  the  running  times  for  the 
planners  in  the  Robot  Navigation  domain.  The 
times  for  RTDP  and  LRTDP  grew  quite  rapidly,  and 
they  were  unable  to  solve  many  of  the  problems  at 
all  because  of  memory  overflows.  RTDPSH0P2  and 
LRTDPsh0P2  had  no  memory  problems,  and  their 
running  times  were  quite  small. 

An  explanation  about  the  performance  of  RTDP 
and  LRTDP  in  our  experiments  is  in  order.  As  men¬ 
tioned  before,  LRTDP  uses  a  labeling  mechanism  to 
mark  states  whose  values  have  converged  so  that 
the  algorithm  does  not  visit  them  again  during  the 
search  process.  In  cases  where  the  value  of  a  state 
does  not  converge  until  towards  the  end  of  planning, 
labeling  states  does  not  help  much  to  improve  the 
performance.  We  observed  in  our  experiments  that 
LRTDP  required  significant  times  for  attempting  to 
label  the  states  it  visits.  On  the  other  hand,  RTDP 
was  free  from  such  overhead,  and  it  was  able  to 
perform  better  than  LRTDP  on  our  problems. 


3 Each  time  RTDP  or  LRTDP  had  a  memory  over¬ 
flow,  we  ran  it  again  on  another  problem  of  the  same 
size.  We  omitted  each  data  point  on  which  there  were 
more  than  five  memory  overflows.  Thus  our  data  make 
the  performance  of  RTDP  and  LRTDP  look  better  than 
it  really  was — but  this  makes  little  difference  since 
they  performed  so  much  worse  than  RTDPSH0P2  and 
LRTDPsh0P2. 


Table  2:  Run  times  using  hmax  on  Robot-Navigation 
problems  with  one  kid  door,  p  is  the  number  of  pack¬ 
ages.  Each  data  point  is  the  average  of  20  problems. 


V  = 

1 

2 

3 

4 

5 

RTDP 

23.847 

629.458 

- 

- 

- 

LRTDP 

15.078 

383.173 

- 

- 

- 

Fwd-VISH0P2 

0.011 

0.034 

0.085 

0.136 

0.251 

RTDPshop2 

0.01 

0.031 

0.082 

0.125 

0.224 

lrtdpSHOP2 

0.011 

0.040 

0.089 

0.141 

0.258 

Related  Work 

In  addition  to  the  RTDP  and  LRTDP  algorithms 
described  earlier,  another  similar  algorithm  is  LAO* 
(Hansen  &  Zilberstein  2001),  which  is  based  on  the 
classical  AO*  search  algorithm.  LAO*  can  do  Value 
Iteration  or  Policy  Iteration  in  order  to  update  the 
values  of  the  states  in  the  search  space,  and  can 
generate  optimal  policies  under  certain  conditions. 

Domain-specific  knowledge  has  been  used  to 
search  MDPs  in  reinforcement  learning  research 
(Parr  1998;  Dietterich  2000).  These  approaches  are 
based  on  hierarchical  abstraction  techniques  that 
are  somewhat  similar  to  HTN  planning.  Given  an 
MDP,  the  hierarchical  abstraction  of  the  MDP  is 
analogous  to  an  instance  of  the  decomposition  tree 
that  an  HTN  planner  might  generate.  However, 
the  abstractions  must  be  supplied  in  advance  by 
the  user,  rather  than  being  generated  on-the-fly  by 
the  HTN  planner. 

In  envelope-based  MDP  planning  (Dean  et  al. 
1995),  an  envelope  of  an  MDP  M  is  a  smaller  MDP 
M'  C  M.  Envelope-based  planning  algorithms  are 
anytime  algorithms.  An  envelope-based  planner 
begins  with  an  initial  envelope,  and  computes  an 
optimal  policy  for  this  envelope.  On  subsequent 
iterations,  it  does  the  same  thing  on  larger  and 
larger  envelopes,  stopping  when  time  runs  out  or 
when  the  current  envelope  contains  all  of  M.  The 
initial  envelope  is  typically  generated  by  a  search 
algorithm,  which  often  is  a  classical  planning  al¬ 
gorithm  such  as  Graphplan.  This  suggests  that 
domain-configurable  planning  algorithms  such  as 
TLPIan  and  SHOP2  would  be  good  candidates  for 
the  search  algorithm,  but  we  do  not  know  of  any 
case  where  they  have  been  tried. 

Our  ideas  in  this  paper,  and  in  particular  the  no¬ 
tion  of  a  search-control  function,  were  influenced 
by  the  work  of  Kuter  &  Nau  (2004),  in  which  they 
described  a  way  to  generalize  a  class  of  classical 
planners  and  their  search-control  functions  for  use 
in  nondeterministic  planning  domains ,  where  the 
actions  have  nondeterministic  effects  but  no  prob¬ 
abilities  for  state  transitions  are  known. 

Conclusions  and  Future  Work 

In  this  paper,  we  have  described  a  way  to  take 
any  forward-chaining  MDP  planner,  and  modify 


it  to  include  the  search-control  algorithm  from  a 
forward-chaining  domain-configurable  planner  such 
as  TLPIan,  SH0P2,  or  TALplanner. 

If  the  search-control  algorithm  satisfies  an  “ad¬ 
missibility”  condition,  then  the  modified  MDP 
planner  is  guaranteed  to  find  optimal  solutions.  If 
the  search-control  algorithm  generates  a  smaller  set 
of  actions  at  each  node  than  the  original  MDP  al¬ 
gorithm  did,  then  the  modified  planner  will  run  ex¬ 
ponentially  faster  than  the  original  one. 

To  evaluate  our  approach  experimentally,  we 
have  taken  the  search-control  algorithm  from  the 
SH0P2  planner  (Nau  et  al.  2003),  and  incorpo¬ 
rated  it  into  three  MDP  planners:  RTDP  (Bonet 
&  Geffner  2000),  LRTDP  (Bonet  &  Geffner  2003), 
and  Fwd-VI,  a  forward-chaining  version  of  Value  It¬ 
eration  (Berstekas  1995).  We  have  tested  the  per¬ 
formance  of  the  modified  algorithms  in  two  MDP 
planning  domains.  For  the  original  planning  algo¬ 
rithms,  the  running  time  and  memory  requirements 
grew  very  quickly  as  a  function  of  problem  size,  and 
the  planners  ran  out  of  memory  on  many  of  the 
problems.  In  contrast,  the  modified  planning  algo¬ 
rithms  had  no  memory  problems  and  they  solved 
all  of  the  problems  very  quickly. 

In  the  near  future,  we  are  planning  to  do  ad¬ 
ditional  theoretical  and  experimental  analyses  of 
our  technique.  We  also  are  interested  in  extend¬ 
ing  the  technique  to  MDPs  that  have  continuous 
state  spaces;  such  state  spaces  often  arise  in  fields 
like  control  theory  and  operations  research.  We 
are  currently  working  on  that  problem  jointly  with 
researchers  who  specialize  in  the  fields  of  control 
theory  and  operations  research. 
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